History and major milestones in bioinformatics


Prerequisites: None.
Level: Beginner.
Learning objectives: Basic understanding of how bioinformatics has developed over time.

Major milestones in bioinformatics

Bioinformatics is a rapidly growing field. It combines biology, computer science, and information technology to analyze and interpret biological data. Over the past few decades, numerous milestones in bioinformatics have significantly impacted our understanding of biology and the development of new therapies and treatments.

  • 1965: Margaret Dayhoff developed the first protein sequence database, which was called the Atlas of Protein Sequence and Structure. This was a major step towards understanding the relationship between protein structure and function.
  • 1970: Saul B. Needleman and Christian D. Wunsch published the first sequence alignment method to align and compare protein and nucleotide sequences.
  • 1971: The RCSB Protein Data Bank.
  • 1977: Frederick Sanger developed a rapid method for determining the base sequence of DNA. This was the first time that DNA sequencing had been automated, and it paved the way for the Human Genome Project.
  • 1981: The Smith-Waterman sequence alignment algorithm, useful in identifying regions of similarity that may indicate functional, structural, or evolutionary relationships between two sequences.
  • 1982: GenBank, a database of nucleotide sequences, created in 1982 by the National Institutes of Health (NIH) as a way to store and share genetic information.
  • 1984: The PIR-International Protein Sequence Database.
  • 1990: The Human Genome Project was launched. This ambitious project aimed to sequence the entire human genome, and it was completed in 2003.
  • 1996: The first proteomics database, SWISS-PROT, was created. This database contains information about protein sequences, functions, and structures.
  • Late 1990s and early 2000s: The field of metagenomics was established. This field focuses on studying the genetic material of entire microbial communities, rather than just individual organisms.
  • 2001: The first draft of the human genome was published. This was a major breakthrough in our understanding of human biology, and it opened up new avenues for research and drug development.
  • 2002: UniProt protein sequence database.
  • 2010: The first synthetic genome was created. This was a landmark achievement in the field of synthetic biology, and it paved the way for the creation of new organisms with custom-designed genomes.
  • 2012: The CRISPR-Cas9 system was discovered. This revolutionary technology allows scientists to edit genomes with unprecedented precision and accuracy.
  • 2023: The integration of artificial intelligence (AI) and machine learning (ML) into bioinformatics tools and workflows is revolutionizing the field. AI and ML are being used to analyze large datasets, predict protein structures, and develop new drugs.

The Genome

Let us start with the concept of the genome. The concept of the genome has been fundamental in genomics, providing a framework for understanding the genetic basis of life and the relationship between genetics and disease. German botanist Hans Winkler coined the term "genome" in 1920 to describe the set of chromosomes that specify the genetic makeup of a species. Winkler had a keen interest in the genetic mechanisms of heredity and the relationship between chromosome number and parthenogenesis. He previously used related terms like "pangene" and "genotype" to refer to the genetic material of a species.

The term "genome" was used occasionally in the decades following its introduction by Hans Winkler, but by 1970 it had been widely adopted by the scientific community. Its usage expanded broadly in the following decade. In 1986, the Canadian Journal of Genetics and Cytology changed its name to Genome, marking a new period in its history. In the Genome journal's wake, several other journals emerged in the early 1990s, including Human Genome Review, Mammalian Genome, and the International Journal of Genome Research.

The real revolution came in October 1990 when the Human Genome Project (HGP) and the International Human Genome Sequencing Consortium (IHGSC) launched The Human Genome Project (HGP). This project was outlined in 1988 by a special U.S. National Academy of Sciences committee to sequence the entire human genome. This massive undertaking involved scientists and researchers from multiple countries. And about years after the launch, in 1995 when J. Craig Venter's lab released the first complete genomes of two pathogens, Haemophilus influenzae and Mycoplasma genitalium. The term "genome" became a buzzword even in the general population and marked the birth of modern genomics, which led to an explosion of new journals in the same year.

image

Genome sequencing is not bioinformatics per se; however, the sequence data from the wet lab requires sophisticated computing algorithms that transform those data into coherent, contiguous, and human-readable forms. Furthermore, the humungous amounts of sequencing data need appropriate storage and ways to search the databases and compare sequences for various further studies.

The Sequences

image

Frederick Sanger wrote about his conviction that “...a knowledge of sequences could contribute much to our understanding of living matter” (Sanger F. 1980). Already in the early 1950s, Frederick Sanger used his Nobel Prize-winning method to sequence the two chains of the insulin molecule - a protein.

After discovering DNA's structure in 1957, many scientists put a lot of effort into determining the nucleotide order of DNA. Two decades later, Sanger managed to sequence the genome of bacteriophage phi X174, 5,375 nucleotides using his initial “plus and minus system" in 1977. You can read more about the discovery of the DNA structure in our history section, “Is it Protein or DNA that carries genetic information?”

Maxam and Gilbert published their sequencing method in the same year, and at the end of the same year, Sanger published his second method, which initially produced about 100 nucleotides long read lengths.

Both Gilbert and Sanger received a shared Nobel Prize in Chemistry in 1980 for their work. However, over time the Sanger sequencing method became more widely used because, among others, the Maxam-Gilbert method required radioactive labeling.

Computational tools and algorithms

During the years following Sanger's invention of the protein sequencing method, many scientists sequenced proteins, resulting in an increasing number of sequences. However, appropriate computer analysis methods for protein sequences were still unavailable in the 50s and 60s.

Computers were still expensive, occupied whole rooms, and required assembly language to program. Programming using assembly language was tedious work. With his team at IBM, John Backus published the FORTRAN programming language in 1957, which is still evolving, and NASA uses it for mission-critical implementations, for example.

image

ENIAC (Electronic Numerical Integrator And Computer) in Philadelphia, Pennsylvania.

Several other programming languages followed, such as the LISP, COBOL, Pascal, and Dennis Ritchie, with his colleagues at Bell Labs, who created the C programming language in the 1970s. Notably, they also rewrote the UNIX code in C.

The early pioneers during the 60s were Magaret Dayhoff and Richard Eck. They compiled all known protein sequences from published literature and used punch cards to store them in a computer for further analyses. Specifically, they estimated how many amino acids changed over a time period and used this information to construct a phylogenetic tree, depicting the ancestral relationships of proteins.

Their work resulted in a model of evolutionary change in proteins in the form of PAM matrices in 1978. Moreover, Magaret Dayhoff and Richard Eck presented the one-letter code for amino acids, a standard today with some modifications to the original.

image

Making punch cards.

In 1970 Saul B. Needleman and Christian D. Wunsch published the first sequence alignment method to align and compare protein and also nucleotide sequences. They based their approach on the Dynamic Programming method that the mathematician Richard E. Bellman had developed already in 1953. With the Needleman-Wunsch method, scientists could now adequately compare sequences using a computer. The Needleman-Wunsch method is also called 'the global sequence alignment' algorithm.

In 1981, Temple F. Smith and Michael S. Waterman published The Smith-Waterman sequence alignment algorithm, a dynamic programming algorithm used to align two nucleotide or amino acid sequences. It is instrumental in identifying regions of similarity that may indicate functional, structural, or evolutionary relationships between two sequences.

The algorithm is significant because it can locate local alignments that other alignment methods may miss. It can also identify suboptimal alignments that may be important for understanding the evolutionary history of a set of sequences. The algorithm has been widely used in bioinformatics for sequence annotation, genome assembly, and protein structure prediction tasks. Its impact on bioinformatics has been substantial and remains a fundamental tool for many bioinformatics applications.

The Human Genome Project

image

The Human Genome Project (HGP) was a massive international research effort launched in 1990 to map and sequence the entire human genome. Scientists from at least 20 different countries participated in the project. This project was a response to the growing understanding of the importance of genetics in human health and disease. The idea of mapping the human genome was first proposed in the 1970s, but it was in the development of new technologies in the 1980s that the project became feasible. The Human Genome Project was completed in 2003, and its results have profoundly impacted our understanding of genetics and human health.

Metagenomics

image

The field of metagenomics was established in the late 1990s and early 2000s, building on the development of DNA sequencing technologies. Traditional microbiology techniques involved isolating and culturing single bacterial strains to study them. However, it became apparent that the vast majority of bacteria in natural environments could not be cultured in the lab, making it difficult to study the diversity and functionality of these communities.

Metagenomics addressed this challenge by using DNA sequencing to study entire microbial communities directly from environmental samples, allowing researchers to analyze the genetic material of all the microorganisms in a given sample, regardless of whether they were culturable. The development of metagenomics led to a revolution in our understanding of microbial diversity and function in natural environments. It opened up new avenues of research in fields such as environmental microbiology and biotechnology.

One of the vital early milestones in the field of metagenomics was the publication of a study in 1998 that used PCR to amplify and sequence ribosomal RNA genes from uncultured microorganisms in an environmental sample. This study demonstrated the potential of metagenomics and paved the way for further research.

Another significant milestone was the publication of a study in 2004 that used shotgun sequencing to analyze the microbial community in the Sargasso Sea. This study provided one of the first comprehensive analyses of a complex microbial community using metagenomics, and it revealed a much higher level of diversity than was previously appreciated.

Metagenomics continues to be a rapidly evolving field, with new technologies and methods continually being developed. The field has essential applications in areas such as environmental monitoring, biotechnology, and human health, and it is likely to remain a critical research area for many years to come.

Metagenomics has recently been used to study microbial communities in various environments, including soil, oceans, and the human gut. By analyzing these communities' genetic material, researchers have gained new insights into their roles in various ecological processes, such as nutrient cycling and disease transmission.

The development of high-throughput sequencing technologies has also enabled researchers to analyze large amounts of metagenomic data in a short amount of time. High-throughput sequencing has led to the development of new computational tools and algorithms for analyzing and interpreting this data. It has allowed researchers to explore the relationships between microbial taxa and their functions within a community.

One of the current challenges in metagenomics is the development of standardized methods for sample collection, storage, and analysis. As metagenomics is a relatively new field, much must be learned about the best practices for studying microbial communities in different environments. However, new technologies and methods will likely emerge as the area evolves to address these challenges.

Overall, the establishment of the metagenomics field has significantly impacted our understanding of microbial diversity and function in natural environments. It has opened up new avenues of research and has the potential to lead to new discoveries in fields such as environmental microbiology, biotechnology, and human health.

The Databases

image

The RCSB Protein Data Bank is a resource that provides access to information about the three-dimensional structures of proteins, nucleic acids, and complex assemblies. It was established in 1971 as a repository for structural data and has since grown to contain over 150,000 structures. The database provides a valuable resource for researchers studying the structure and function of biological macromolecules and has played a vital role in advancing the field of structural biology. In recent years, the database has also begun incorporating drug discovery and design data, making it an essential tool for the pharmaceutical industry.

In 1982, GenBank, a database of nucleotide sequences, was created by the National Institutes of Health (NIH) to store and share genetic information. However, Walter Goad started the database at Los Alamos National Laboratory. Today, the database contains millions of sequences from a wide range of organisms, and it has been an essential tool for researchers in the field of bioinformatics.

GenBank has undergone significant changes since its creation. In the early days, the database was maintained manually, with researchers submitting their sequences on paper forms. However, as the number of submissions grew, this approach became impractical.

In 1986, the NIH began accepting electronic submissions; by 1988, the entire GenBank database was available in electronic form. The database continued to grow in the following years, and new features were added to make it easier to search and analyze the data.

In 1992, GenBank was made available over the internet, which made it accessible to researchers all over the world. Access over the internet was a significant milestone in the history of bioinformatics, as it allowed researchers to share and access genetic information more efficiently than ever before.

Since then, GenBank has continued to evolve, with new data types being added and new tools being developed to analyze the data. Today, it remains one of the most important resources for researchers in bioinformatics, and it continues to play a critical role in advancing our understanding of genetics and genomics.

The PIR-International Protein Sequence Database was one of the earliest databases established in 1984. It was an essential milestone in bioinformatics, allowing researchers to analyze and compare protein sequences on a large scale. The database was later incorporated into the UniProt Knowledgebase, which is now one of the world's most widely used protein sequence databases.

UniProt, which stands for Universal Protein Resource, is a comprehensive protein database created in 2002 by merging three separate databases: the Swiss-Prot, TrEMBL, and PIR-PSD. Swiss-Prot was initially created as a protein sequence database in 1986 by Amos Bairoch and his team at the University of Geneva. TrEMBL, on the other hand, was a computer-annotated supplement to Swiss-Prot that was created in 1996. Swiss-Prot and TrEMBL were later merged with the PIR-PSD database to create UniProt.

Today, UniProt is one of the largest protein databases in the world, containing information on millions of proteins from a wide range of species. Researchers in the field of bioinformatics widely use it for a variety of applications, including protein identification, characterization, and annotation. UniProt also provides many tools and resources to help researchers analyze and interpret protein data, making it an invaluable resource for the scientific community.

The Synthetic Genomes

image

The first synthetic genome was created in 2010 by J. Craig Venter Institute researchers. It was a synthetic version of a bacterium's genome inserted into another bacterium, effectively creating a new, synthetic organism. This achievement was a significant milestone in bioinformatics, as it allowed for the creation of new organisms with tailored genetic traits. Before the creation of the synthetic genome, genetic engineering was limited to modifying existing genes within an organism.

While the creation of the first synthetic genome was a significant achievement, it also raised ethical concerns about the creation of life in the lab. Critics argued that the technology could be used to create dangerous organisms or to engineer organisms that could be used as weapons. Therefore, the development of new tools and techniques in the field of genetic engineering should be approached with caution and ethical considerations.

Despite these concerns, the creation of the first synthetic genome paved the way for further research in synthetic biology. It has led to the development of new techniques for genetic engineering. Today, researchers continue to explore the possibilities of genetic engineering, and the development of new tools and methods will likely open up new frontiers in this field.

The CRISPR-Cas9 system

image

The CRISPR-Cas9 system is a relatively recent development in bioinformatics and genetic engineering. It was first discovered in 2012 by researchers from the University of California, Berkeley, and the University of Vienna, who studied how bacteria defend themselves against viral infections.

The CRISPR-Cas9 system is a powerful tool for genetic engineering, allowing researchers to edit genes selectively with a high degree of precision. The system uses a guide RNA to target a specific DNA sequence and an enzyme called Cas9 to cut the DNA at that location. The cell's repair mechanisms can then repair this cut, allowing researchers to add or remove specific genes as desired.

The discovery of the CRISPR-Cas9 system has been hailed as a breakthrough in genetic engineering. It has already been used in various applications, from developing disease-resistant crops to treating human genetic disorders.

However, the development of the CRISPR-Cas9 system did not occur in isolation but instead built on a long history of research in genetic engineering. The ability to manipulate genes in living organisms was first demonstrated in the 1970s when researchers developed the techniques of recombinant DNA technology, which allowed researchers to cut and splice genes from one organism into another, opening up new possibilities for genetic engineering.

Since then, researchers have developed a wide range of tools and techniques for genetic engineering, including gene knockouts, gene silencing, and gene editing. The development of these techniques has been driven by the need to understand the genetic basis of disease and to develop new treatments and therapies.

The discovery of the CRISPR-Cas9 system represents a significant milestone in genetic engineering, as it offers a powerful new tool for selectively editing genes with a high degree of precision. As the use of this technology continues to grow, it will likely have a significant impact on fields such as agriculture, medicine, and biotechnology.

To further understand the historical context of CRISPR-Cas9, it is essential to note that the CRISPR system was first discovered in bacteria in 1987 by researchers studying the immune system of bacteria. The system consisted of clustered regularly interspaced short palindromic repeats (CRISPRs) and associated proteins (Cas) that functioned as a primitive immune system in bacteria.

In the early 2000s, researchers investigated the CRISPR system further and discovered that it could be used as a genetic tool for selectively editing genes. In 2012, the discovery of the CRISPR-Cas9 system by Jennifer Doudna and Emmanuelle Charpentier opened up new possibilities for genetic engineering by providing a simple and effective way to edit genes.

Since then, the CRISPR-Cas9 system has been extensively studied and refined, and new variants of the system have been developed that offer even greater precision and control over gene editing. The development of the CRISPR-Cas9 system has also led to the emergence of new ethical and regulatory challenges, as the technology has the potential to be used for both beneficial and harmful purposes.

The integration of artificial intelligence (AI) and machine learning (ML) into bioinformatics

image

Integrating artificial intelligence (AI) and machine learning (ML) into bioinformatics tools and workflows is revolutionizing the field. These technologies are helping to address some of the most pressing challenges in bioinformatics, including the analysis of large and complex data sets, the identification of patterns and relationships within data, and the prediction of biological outcomes.

One of the key areas where AI and ML are applied in bioinformatics is the analysis of genomic data. The human genome contains more than 3 billion base pairs; analyzing this vast data is a significant challenge for researchers. AI and ML algorithms can help to identify patterns and relationships within the data, allowing researchers to understand the underlying biology better.

Another area where AI and ML are being applied in drug discovery. Developing new drugs is time-consuming and expensive, with a high failure rate. AI and ML algorithms can help to identify potential drug targets and predict the efficacy of new drugs, helping to speed up the drug discovery process and reduce costs.

AI and ML are also being used to develop new tools and workflows for bioinformatics analysis. For example, machine learning algorithms can classify different types of biological data, such as gene expression data or protein structures, automating the analysis process and reducing the need for manual intervention.

Despite the many benefits of AI and ML in bioinformatics, some challenges must be addressed. One of the main challenges is the need for high-quality data to train these algorithms, and this requires extensive and diverse data sets, which can be challenging to obtain in some cases. There is also a need for standardized data collection and analysis methods to ensure that the results are accurate and reproducible.

Integrating AI and ML in bioinformatics has already led to significant breakthroughs. For example, researchers at the University of California, San Francisco, used machine learning algorithms to analyze genomic data from more than 10,000 tumors. The algorithms were able to identify patterns that could predict which tumors were likely to respond to a particular type of immunotherapy, improve patient outcomes and reduce the need for expensive and time-consuming clinical trials.

Another example of the potential of AI and ML in bioinformatics is the development of deep learning algorithms for predicting protein structures. Proteins are the building blocks of life and play a critical role in many biological processes. However, predicting the structure of a protein from its amino acid sequence is a complex and challenging task. Deep learning algorithms have shown promise in this area, with researchers at the University of Washington developing a deep learning algorithm that can predict the structure of a protein with near-atomic accuracy.

In addition to these examples, there are many other areas where AI and ML are being applied in bioinformatics, including the analysis of single-cell data, the prediction of drug toxicity, and the identification of disease-causing mutations.

As with any new technology, there are concerns about using AI and ML in bioinformatics. One concern is the potential for these algorithms to reinforce existing biases in the data, leading to inaccurate or discriminatory results. Another concern is the need to ensure that these algorithms are transparent and explainable so that researchers can understand how they make predictions and decisions.

Researchers are developing new methods for training and testing AI and ML algorithms in bioinformatics to address the above concerns, including creating more diverse and representative data sets and new strategies for evaluating the accuracy and fairness of these algorithms.

In drug discovery, AI and ML algorithms have already been used to identify new drug candidates, predict the efficacy of drugs, and reduce the risk of adverse side effects. These algorithms can analyze large data sets of drug-target interactions, drug structures, and molecular properties to identify potential drug candidates and help pharmaceutical companies to streamline the drug discovery process, reducing the time and cost of developing new drugs.

AI and ML algorithms are also being used to develop personalized medicine, which involves tailoring treatments to individual patients based on their genetic makeup. Genomic data, AI, and ML algorithms can identify genetic markers associated with a disease or condition. This information can then be used to develop personalized treatment plans, improving patient outcomes and reducing healthcare costs.

Another area where AI and ML are applied in bioinformatics is developing predictive models for disease diagnosis and prognosis. These models can analyze patient data, such as genomic data, medical history, and clinical data, to predict the likelihood of developing a particular disease, the severity of the disease, and the likely response to treatment.

href

References

  1. Melania E. Cristescu. The concept of genome after one century of usage. Genome. 62(10): iii-v. https://doi.org/10.1139/gen-2019-0129
  2. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995 Jul 28;269(5223):496-512. https://doi.org/10.1126/science.7542800. PMID: 7542800.
  3. Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA, Fleischmann RD, Bult CJ, Kerlavage AR, Sutton G, Kelley JM, Fritchman RD, Weidman JF, Small KV, Sandusky M, Fuhrmann J, Nguyen D, Utterback TR, Saudek DM, Phillips CA, Merrick JM, Tomb JF, Dougherty BA, Bott KF, Hu PC, Lucier TS, Peterson SN, Smith HO, Hutchison CA 3rd, Venter JC. The minimal gene complement of Mycoplasma genitalium. Science. 1995 Oct 20;270(5235):397-403. https://doi.org/10.1126/science.270.5235.397. PMID: 7569993.
  4. Heather JM, Chain B. The sequence of sequencers: The history of sequencing DNA. Genomics. 2016 Jan;107(1):1-8. https://doi.org/10.1016/j.ygeno.2015.11.003. Epub 2015 Nov 10. PMID: 26554401; PMCID: PMC4727787.
  5. Sanger F. 1980. Frederick Sanger — Biographical. https://www.nobelprize.org/prizes/chemistry/1980/sanger/biographical/
  6. Sanger F, Coulson AR. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol. 1975 May 25;94(3):441-8. https://doi.org/10.1016/0022-2836(75)90213-2. PMC.
  7. A M Maxam and W Gilbert. A new method for sequencing DNA. Proc Natl Acad Sci U S A. 1977 Feb; 74(2): 560–564. https://doi.org/10.1073/pnas.74.2.560. PMC.
  8. F. Sanger, S. Nicklen, and A. R. Coulson DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A. 1977 Dec; 74(12): 5463–5467. https://doi.org/10.1073/pnas.74.12.5463. PMC.
  9. Thomas Clune, Computational & Information Sciences & Technology Office, NASA Goddard Space Flight Center April 28, 2015. (NASA and the Future of Fortran) https://www.nas.nasa.gov/publications/ams/2015/04-28-15.html#:~:text=The Fortran programming language remains,role in many NASA projects.
  10. A Model of Evolutionary Change in Proteins. M.O. Dayhoff, R.M. Schwartz, and B. C, Orcutt 1978. In: Atlas of Protein Sequence and Structure. M.O. Dayhoff, ed., pp. 345–352. National Biomedical Research Foundation, Washington, DC. http://chagall.med.cornell.edu/BioinfoCourse/PDFs/Lecture2/Dayhoff1978.pdf
  11. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970 Mar;48(3):443-53. https://doi.org/10.1016/0022-2836(70)90057-4. PMID: 5420325.
  12. Robert S. Roth, ed. (1986). The Bellman Continuum: A Collection of the Works of Richard E. Bellman. World Scientific. p. 4. ISBN 9789971500900.
  13. Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981 Mar 25;147(1):195-7. doi: 10.1016/0022-2836(81)90087-5. PMID: 7265238.
  14. Piekarowicz A. Werner Arber, Daniel Nathans i Hamilton Smith. Nagrody Nobla za badania nad enzymami restrykcyjnymi [Werner Arber, Daniel Nathans and Hamilton Smith. Nobel prizes for the studies on DNA restriction enzymes]. Postepy Biochem. 1979;25(2):251-3. Polish. PMID: 388391.
  15. Konforti B. History. The servant with the scissors. Nat Struct Biol. 2000 Feb;7(2):99-100. doi: 10.1038/72469. PMID: 10655607.
  16. Roberts RJ. Restriction endonucleases. CRC Crit Rev Biochem. 1976 Nov;4(2):123-64. doi: 10.3109/10409237609105456. PMID: 795607.
  17. Pingoud A, Wilson GG, Wende W. Type II restriction endonucleases--a historical perspective and more. Nucleic Acids Res. 2014 Jul;42(12):7489-527. doi: 10.1093/nar/gku447. Epub 2014 May 30. Erratum in: Nucleic Acids Res. 2016 Sep 19;44(16):8011. PMID: 24878924; PMCID: PMC4081073.
  18. Arber W, Linn S. DNA modification and restriction. Annu Rev Biochem. 1969;38:467-500. doi: 10.1146/annurev.bi.38.070169.002343. PMID: 4897066.
  19. Lander ES, Linton LM, Birren B, et al.; International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001 Feb 15;409(6822):860-921. doi: 10.1038/35057062. Erratum in: Nature 2001 Aug 2;412(6846):565. Erratum in: Nature 2001 Jun 7;411(6838):720. Szustakowki, J [corrected to Szustakowski, J]. PMID: 11237011.
  20. Venter JC, Adams MD, Myers EW, et al.. The sequence of the human genome. Science. 2001 Feb 16;291(5507):1304-51. doi: 10.1126/science.1058040. Erratum in: Science 2001 Jun 5;292(5523):1838. PMID: 11181995.
  21. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature. 2004 Oct 21;431(7011):931-45. doi: 10.1038/nature03001. PMID: 15496913.
  22. McEntyre J, Ostell J, editors. The NCBI Handbook [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2002-. Available from: https://www.ncbi.nlm.nih.gov/books/NBK21101/?depth=2
  23. UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 2015 Jan;43(Database issue):D204-12. doi: 10.1093/nar/gku989. Epub 2014 Oct 27. PMID: 25348405; PMCID: PMC4384041.
  24. wwPDB consortium. Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 2019 Jan 8;47(D1):D520-D528. doi: 10.1093/nar/gky949. PMID: 30357364; PMCID: PMC6324056.
  25. Gibson DG, Glass JI, Lartigue C, Noskov VN, Chuang RY, Algire MA, Benders GA, Montague MG, Ma L, Moodie MM, Merryman C, Vashee S, Krishnakumar R, Assad-Garcia N, Andrews-Pfannkoch C, Denisova EA, Young L, Qi ZQ, Segall-Shapiro TH, Calvey CH, Parmar PP, Hutchison CA 3rd, Smith HO, Venter JC. Creation of a bacterial cell controlled by a chemically synthesized genome. Science. 2010 Jul 2;329(5987):52-6. doi: 10.1126/science.1190719. PMID: 20488990.
  26. Jinek M, Chylinski K, Fonfara I, Hauer M, Doudna JA, Charpentier E. A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science. 2012 Aug 17;337(6096):816-21. doi: 10.1126/science.1225829. PMID: 22745249; PMCID: PMC6286148.
  27. Aliper A, Plis S, Artemov A, Ulloa A, Mamoshina P, Zhavoronkov A. Deep Learning Applications for Predicting Pharmacological Properties of Drugs and Drug Repurposing Using Transcriptomic Data. Mol Pharm. 2016 Jul 5;13(7):2524-30. doi:10.1021/acs.molpharmaceut.6b00248. PMID: 27200455; PMCID: PMC4965264.
  28. Eraslan G, Avsec Ž, Gagneur J, Theis FJ. Deep learning: new computational modelling techniques for genomics. Nat Rev Genet. 2019 Jul;20(7):389-403. doi:10.1038/s41576-019-0122-6. PMID: 30971806.
  29. Issa NT, Stathias V, Schürer S, Dakshanamurthy S. Machine and deep learning approaches for cancer drug repurposing. Semin Cancer Biol. 2021 Jan;68:132-142. doi:10.1016/j.semcancer.2019.12.011. PMID: 31904426; PMCID: PMC7723306.
  30. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D. Highly accurate protein structure prediction with AlphaFold. Nature. 2021 Aug;596(7873):583-589. doi:10.1038/s41586-021-03819-2. PMID: 34265844; PMCID: PMC8371605.
  31. Poirion OB, Jing Z, Chaudhary K, Huang S, Garmire LX. DeepProg: an ensemble of deep-learning and machine-learning models for prognosis prediction using multi-omics data. Genome Med. 2021 Jul 14;13(1):112. doi:10.1186/s13073-021-00930-x. PMID: 34261540; PMCID: PMC8281595.
  32. Tran KA, Kondrashova O, Bradley A, Williams ED, Pearson JV, Waddell N. Deep learning in cancer diagnosis, prognosis and treatment selection. Genome Med. 2021 Sep 27;13(1):152. doi: 10.1186/s13073-021-00968-x. PMID: 34579788; PMCID: PMC8477474.
  33. Lei Y, Li S, Liu Z, Wan F, Tian T, Li S, Zhao D, Zeng J. A deep-learning framework for multi-level peptide-protein interaction prediction. Nat Commun. 2021 Sep 15;12(1):5465. doi: 10.1038/s41467-021-25772-4. PMID: 34526500; PMCID: PMC8443569.
  34. Kang M, Ko E, Mersha TB. A roadmap for multi-omics data integration using deep learning. Brief Bioinform. 2022 Jan 17;23(1):bbab454. doi: 10.1093/bib/bbab454. PMID: 34791014; PMCID: PMC8769688.
  35. Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A. A primer on deep learning in genomics. Nat Genet. 2019 Jan;51(1):12-18. doi: 10.1038/s41588-018-0295-5. Epub 2018 Nov 26. PMID: 30478442.
  36. Patel L, Shukla T, Huang X, Ussery DW, Wang S. Machine Learning Methods in Drug Discovery. Molecules. 2020 Nov 12;25(22):5277. doi: 10.3390/molecules25225277. PMID: 33198233; PMCID: PMC7696134.
  37. Berrar D, Dubitzky W. Deep learning in bioinformatics and biomedicine. Brief Bioinform. 2021 Mar 22;22(2):1513-1514. doi: 10.1093/bib/bbab087. PMID: 33693457; PMCID: PMC8485073.